Welcome

To Advance through Presentation Use Page Up and Page Down Keys



.............



Developers Conference

#### Data Prefetching

Kalpesh Gala and Jim Robertson



#### Outline

Motivation/Overview Overview of Prefetch Instructions Instruction Format and Arguments Memory Access Example Stopping Prefetching Other Considerations





#### Motivation

- By prefetching the data BEFORE it is needed:
  - The data will be in the local (L1) cache
  - The page table entry will be in the TLB when the memory access
  - load/store instructions accessing memory execute quickly (no bus access)



#### Overview

- G4 supports software-directed prefetch
- Uses idle bus cycles to load data into cache before it is needed
- When the load/store instructions are actually executed, data is in cache
- "Data Stream Touch" instructions control software-directed prefetch



#### Prefetch Instructions

- Four instructions initiate software prefetching
  - Dst—Data Stream Touch
  - Dstt—Data Stream Touch Transient (used for last access)
  - Dstst—Data Stream Touch-for-Store (should not be used)
  - Dststt—Data Stream Touch-for-Store Transient (should not be used)



#### Prefetch Instructions (Cont.)

- Transient instructions indicate the data does not have a long lifetime
  - Transient data will not be castout to the L2 for future use
  - Modified data is written directly to the memory, unmodified data is discarded
- Touch-for-Store instructions mark data as exclusive



#### dstX Instruction Format

- Usage: dstX rA, rB, STRM dstX is one of dst, dstt, dstst, or dststt
  - **rA** is the Address of the first block to prefetch

**rB** encodes the Block Size, Block Count and Stride

 000
 Block Size
 Block Count
 Signed Stride

 0 2 3
 7 8
 15 16
 31

**STRM** is the stream to use; 0 - 3



#### dstX Arguments

- Stream ID—which stream engine to use
  - There are four stream engines, 0-3
- Address—Initial address of the sequence
- Block Size—The number of quad words (16 bytes) in each block
  - Block Size is between 1 and 32
  - Should be at least 2 to fill a single 32byte G4 cache line



### dstX Arguments (Cont.)

Count—Number of blocks in the sequence
Count is between 1 and 256

#### • Stride—Number of bytes between blocks

- Valid stride values:
- -32768 < stride < 0 or 0 < stride < 32768
- To avoid redundant loads, stride must be used correctly



#### dst Memory Access



#### dst Termination

- Two different instructions to allow the user to stop dst streams
  - Dss (Data Stream Stop)—stops a single stream
  - Dssall (Data Stream Stop All)—stops ALL active streams



## dst Termination (Cont.)

- Prefetching may also terminate for any of the following reasons:
  - Successfully reached end of stream
  - Another dst instruction to the same stream is executed
  - Currnet line-fetch causes a table walk which results in a page table miss
  - Current line-fetch is translated as cache inhibited





### dst Termination (Cont.)

- There is no way to identify if a stream has stopped fetching
  - Should re-issue dst instructions periodically "just in case"





#### Other Considerations

- Prefetching is context aware
  - Prefetching is paused if the processor switches from user to supervisor mode
  - Prefetching resumes when switching back to user mode
  - This prevents prefetching from happening during exceptions



# Other Considerations (Cont.)

- No arbitrary address boundaries which stop the progress of a stream
- dstX instructions handle address alignment issues automatically
- All four prefetch engines can be active at the same time



#### Conclusion

- Software prefetching can be a useful tool for increasing performance
- dstX instructions are directly supported by the AltiVec programming model
- Use transient versions of dstX for the last access of the data
- Avoid using the touch-for-store variants of dstX





#### Sim G4 in Depth & Details

99 Worldwide Developers Conference Kalpesh Gala and Jim Robertson



#### Presentation Summary

- Introduction
- Methodology & Tool Flow
- Configuring Sim\_G4
- 64-bit Multiply Example
- Conclusion



#### Introduction

- Sim\_G4 is a trace driven, cycle accurate timing simulator
- Sim\_G4 was developed by Motorola's PowerPC G4 Design Team for architectural decisions
- Limitations
  - No notion of data dependencies
  - Most applications are 95-100% accurate



#### Methodology & Tool Flow

Code the desired algorithm/application (you) Compile (MW, MPW, MrC, mcc) Execute application / generate trace (pitsTT6) Simulate operation on G4 Processor (Sim\_G4)

Analyze simulation results (you)

Optimize (you)



High Performance Code!!! = \$\$\$



## Sim\_G4 Configuration

- Command Line Options
- Simulation Parameters
- Current Sim\_G4 Defaults
  - Processor is a 300 MHz G4
  - System bus running at 75 MHz
  - System is a PowerMac G3



#### Output Configuration

Configuration

Command Line Options #L Simulator Parameters #P

A variety of control parameters can be set through the Command Line Options popup window

| communu Line options                       |                                                                   |  |  |  |  |  |  |
|--------------------------------------------|-------------------------------------------------------------------|--|--|--|--|--|--|
| General Options                            |                                                                   |  |  |  |  |  |  |
| 🗌 -dp                                      | displays progress every N clocks (default is 1,000,000)           |  |  |  |  |  |  |
| -be                                        | enable BAT registers                                              |  |  |  |  |  |  |
| oe                                         | specifies file to send error messages (default is stdout)         |  |  |  |  |  |  |
|                                            | Run-time Parameters                                               |  |  |  |  |  |  |
| 🗌 -irf                                     | specifies a file containing run-time parameter settings           |  |  |  |  |  |  |
|                                            | Pipeline Display Output                                           |  |  |  |  |  |  |
| -op                                        | specifies file for pipeline display (default is stdout)           |  |  |  |  |  |  |
| 🗌 -р                                       | enable detailed MSS pipeline status                               |  |  |  |  |  |  |
| -pm                                        | enable detailed MSS pipeline status (same as -p)                  |  |  |  |  |  |  |
| pc                                         | enable detailed core pipeline status                              |  |  |  |  |  |  |
| 🗌 -рч                                      | enable detailed AltiVeo pipeline status                           |  |  |  |  |  |  |
|                                            | Scrolling Pipeline                                                |  |  |  |  |  |  |
| 🗌 -sp                                      | specifies file for pipeline display (default is stdout)           |  |  |  |  |  |  |
| 🗌 -st                                      | set scrollpipe type: 0 = Horizontal, 1 = Vertical, 2= Wide Vertic |  |  |  |  |  |  |
| -sw                                        | set scrollpipe display to X characters wide (horizontal only)     |  |  |  |  |  |  |
| 🗌 -sh                                      | print scrollpipe format, then exit                                |  |  |  |  |  |  |
| Output Control (for only Displing Display) |                                                                   |  |  |  |  |  |  |
| Cancel OK                                  |                                                                   |  |  |  |  |  |  |

Concerns and Adverse Area Law



### System Configuration

#### Configuration

Command Line Options #L Simulator Parameters #P

A variety of system configuration variables can be set through the Simulation Parameters popup window

| Simulator Parameters         |                                                                      |   |  |  |  |  |  |  |
|------------------------------|----------------------------------------------------------------------|---|--|--|--|--|--|--|
| 60x Bus                      |                                                                      |   |  |  |  |  |  |  |
| bus_mode 60xBus              | External Bus Mode (0 = 60xBus, 1 = native bus)                       |   |  |  |  |  |  |  |
| g4_bus_fraction_numerato 4 🔹 | Number of internal clocks per 1 or 2 bus clocks                      |   |  |  |  |  |  |  |
| g4_bus_fraction_denomin 1 🗘  | 1 => numerator == full processor clocks , 2 => half-processor clocks |   |  |  |  |  |  |  |
| L2 Interface                 |                                                                      |   |  |  |  |  |  |  |
| 12_bus_fraction_numerato 2 📫 | Number of internal clocks per 1 or 2 L2 bus clock                    |   |  |  |  |  |  |  |
| 12_bus_fraction_denomina     | 1 => numerator == full processor clocks , 2 => half-processor clocks |   |  |  |  |  |  |  |
| 12_size 1024 \$              | Size of the L2 (in K bytes)                                          |   |  |  |  |  |  |  |
| 12_disable                   | Disables the L2 Cache                                                |   |  |  |  |  |  |  |
| 12_sram_latency 3            | Latency for the first beat of an access to L2 SRAMs (max: 10)        |   |  |  |  |  |  |  |
| Memory Interface             |                                                                      |   |  |  |  |  |  |  |
| mem_controller MPC106        | External Memory Controller (1=MPC106, 2=NextGen)                     |   |  |  |  |  |  |  |
| dram_type SDRAM              | Type of DRAM (1=ED0, 2=SDRAM)                                        |   |  |  |  |  |  |  |
| dram_row_bits 12             | Number address bits assigned to DRAM Row address                     |   |  |  |  |  |  |  |
| dram_col_bits 10             | Number address bits assigned to DRAM Column address                  |   |  |  |  |  |  |  |
| sdram_bank_bits 1            | Number of address bits used to index SDRAM device banks              |   |  |  |  |  |  |  |
| sdram_close_latency 2        | Latency associated with the closing of a SDRAM device bank           |   |  |  |  |  |  |  |
| sdram hit latency 5          | Latency naused by a hit to an onen SDRAM device hank                 | - |  |  |  |  |  |  |
|                              | Cancel 0K                                                            |   |  |  |  |  |  |  |



#### Processing the Trace

• Assumption: a TT6 trace file has been generated as input to Sim\_G4

| File                   |            |
|------------------------|------------|
| 🗸 Open 116 File        | ж0         |
| Use Apple Event Stream | ЖA         |
| Close Window           | ЖW         |
| Save Results           | ж <b>s</b> |
| Print Results          | ≋Р         |
| Quit                   | жQ         |

• After invoking the application and selecting the desired configuration, an input TT6 file must be designated



#### Sim\_G4 the Last Step...

#### After configuring and supplying the desired trace Sim\_G4 provides profiling information



|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 5                                                                                                                                                                                   | ample.tt6.out                      |              | DB |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------|--------------|----|
| Clocks: 1510 Retired:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | : 77 Folded: 5 IPC=                                                                                                                                                                 | 0.0543                             |              |    |
| Instruction Flow Stats<br>Fetched: 82 Dispate                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | s:<br>obed: 79 Retired: 77                                                                                                                                                          | Branches Folded: 5                 |              |    |
| IB_empty: 90.533<br>CB_full: 0.000<br>GPR_rename: 0.000<br>TFR_rename: 0.000<br>Unit_busy: 3.843<br>FXU1:<br>FXU2:<br>FXU2:<br>FXU2:<br>FXU2:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5:<br>VAU5: | <pre>% (0)<br/>% (1367)<br/>% (0)<br/>% (0)<br/>% (0)<br/>% (0)<br/>% (58)<br/>0.00% (0)<br/>0.00% (0)<br/>0.00% (0)<br/>0.00% (0)<br/>3.77% (57)<br/>0.00% (0)<br/>0.00% (0)</pre> |                                    |              |    |
| Execution Unit Stats/:<br>FXU1: idle: 98.81%<br>FXU2: idle: 99.87%<br>FFU: idle: 99.93%                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | dispatch: 17<br>dispatch: 1                                                                                                                                                         |                                    |              |    |
| TAU: double dispate<br>TAU: idle: 99.87%<br>TAUC: idle: 99.93%                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | ch attempts: 1<br>dispatch: 1                                                                                                                                                       | depend_stall: 0<br>depend_stall: 0 | ser_stall: 0 | ₽  |





#### Demo

#### Conclusion

- Sim\_G4 models the G4 Architecture NOT just the AltiVec engine
- Sim\_G4 can be useful in fine-tuning pieces of code
- Analysis of memory intensive applications may not reflect system
- Sim\_G4 not always 100% accurate



#### Call to Action

- Download the SDK: developer.apple.com/hardware/altivec
- www.mot.com/AltiVec
- Identify data parallelism in your programs
- Vectorize computation intensive code
- Use Sim G4 to tune performance
- Sign up for the next AltiVec kitchen



#### Other AltiVec Sessions

#### AltiVec Workshops Hands-on introduction to AltiVec (pre-registration only)

Room L Fri.

AltiVec Feedback Session Open Q&A session Hall J2 Fri., 10:15am









#### Think different.





Welcome

To Advance through Presentation Use Page Up and Page Down Keys



.............